Skip to content

fix(duplication): prevent potential crash when removing or pausing duplication#2368

Open
lengyuexuexuan wants to merge 5 commits intoapache:masterfrom
lengyuexuexuan:fix_duplication
Open

fix(duplication): prevent potential crash when removing or pausing duplication#2368
lengyuexuexuan wants to merge 5 commits intoapache:masterfrom
lengyuexuexuan:fix_duplication

Conversation

@lengyuexuexuan
Copy link
Collaborator

What problem does this PR solve?

#2211

What is changed and how does it work?

  1. Background
    During duplication of a table, if the commands dup remove/pause are executed or a balance operation is performed at the same time, there is a chance that a node may core dump with signal ID 11. The core dump locations vary, but they all have one thing in common: they occur during memory allocation or deallocation.

  2. Analysis
    Based on extensive testing, the following conclusions can be drawn:
    a. The issue only reproduces when there is write traffic. The difference between having and not having write traffic is: It adds the ship and load_private_log tasks.
    b. The core dump occurs during the execution of cancel_all().
    c. The issue occurs with low probability (approximately 1 in 100).

    Through analysis using ASAN (AddressSanitizer):
    dup_remove_asan.txt
    Based on ASAN analysis, the following conclusions can be drawn:
    a. The memory corruption occurs during the ship process. The mutations obtained from replaying the plog are passed to ship, leading to the issue.
    b. _load_mutations is captured by a lambda expression and then passed to a std::function. Since std::move is used, the lifetime of _load_mutations is tied to that of the std::function.
    c. The cancel_all() function is executed in the default thread pool. At this point, the following function is called. When the std::function is set to nullptr, it will release the memory it manages.

    void clear_non_trivial_on_task_end() override { _cb = nullptr; }

    d. However, each task executes exec_internal() in its own thread pool, and eventually calls release_ref(), which results in delete this.
    this->release_ref(); // added in enqueue(pool)

  3. Conclusion
    1. Both task.cancel() and task.exec_internal() destruct the std::function member inside the task object. These two operations are executed in different threads, and there is no mechanism in place to prevent race conditions between them. As a result, it is possible for both threads to attempt to destruct the same std::function, which can lead to a double deletion of the memory associated with _load_mutations. This ultimately causes memory corruption.
    2. _duplications is accessed without proper synchronization in certain functions under multi-threaded scenarios, potentially causing race conditions.

  4. Solution

    1. Lock the _cb callback to ensure that only one thread executes its destructor.
    2. Add locking to functions that access _duplications without synchronization to prevent concurrent access conflicts.
Tests
  • Manual test
    The changes have been production-validated at Xiaomi, running stably on more than 30 clusters for over six months, confirming that they resolve the concurrency issues described above.

@github-actions github-actions bot added the cpp label Feb 12, 2026
@empiredan
Copy link
Contributor

Hi @lengyuexuexuan Thank you for your contribution!

Please modify the code according to the suggestions provided by Clang Tidy and IWYU.

@lengyuexuexuan
Copy link
Collaborator Author

Hi @lengyuexuexuan Thank you for your contribution!

Please modify the code according to the suggestions provided by Clang Tidy and IWYU.

done

return;
}

zauto_lock l(_lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does _primary_confirmed_decree below no longer need protection by _lock?

Copy link

@acelyc111-bot acelyc111-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Prevent potential crash when removing/pausing duplication

Summary: Fixes race conditions in replica_duplicator_manager by adjusting lock placement. Also adds thread safety to raw_task callback clearing.

What's good:

  • Moving lock acquisition earlier in sync_duplication prevents accessing shared state before protection
  • Removing the redundant lock in update_confirmed_decree_if_secondary is correct since remove_all_duplications now acquires the lock itself
  • The raw_task destructor + clear_non_trivial_on_task_end lock prevents use-after-free on the callback during concurrent task completion

Concerns:

  1. Lock scope change in sync_duplication — The lock now covers the entire function including the early return path and all the logic below. Previously only the mutation of _duplications was locked. This is correct for safety but increases the lock hold time. If sync_duplication is called frequently or does heavy work, this could cause contention. Worth checking if there are any blocking calls inside.

  2. Double lock possibilityupdate_confirmed_decree_if_secondary calls remove_all_duplications() which now acquires _lock. If any caller of update_confirmed_decree_if_secondary already holds _lock, this would deadlock. Please verify there are no such call chains.

  3. raw_task lock naming_lock is very generic. Consider _cb_lock or _callback_lock for clarity.

Verdict: ⚠️ Request Changes — Please verify no double-locking path exists for update_confirmed_decree_if_secondaryremove_all_duplications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants